Mon. Not. R. Astron. Soc. 000, [T|fTol (2009) Printed 30 March 2010 (MN WT&i style file v2.2) 



Bayesian model comparison in cosmology with Population 
Monte Carlo 

Martin Kilbinger 1 ' 2 *, Darren Wraith 3 ' 1 , Christian P. Robert 3 , Karim Benabed 1 , 
Olivier Cappe 4 , Jean- Francois Cardoso 4 ' 1 , Gersende Fort 4 , Simon Primet 1 , Frangois 
R. Bouchet 1 

1 Institut d 'Astrophysique de Paris, UMR 7095 CNRS & Universite Pierre et Marie Curie, 98 bis boulevard Arago, 75014 Paris, France 

2 Shanghai Key Lab for Astrophysics, Shanghai Normal University, Shanghai 200234, P- R- China 

3 CEREMADE, Universite Paris Dauphine, 75016 Paris, France 

4 LTCI, Telecom ParisTech and CNRS, 46, rue Barrault, 75013 Paris, France 

30 March 2010 

ABSTRACT 

We use Bayesian model selection techniques to test extensions of the standard flat 
ACDM paradigm. Dark-energy and curvature scenarios, and primordial perturbation 
models are considered. To that end, we calculate the Bayesian evidence in favour of 
each model using Population Monte Carlo (PMC), a new adaptive sampling technique 
which was recently applied in a cosmological context. In contrast to the case of other 
sampling-based inference techniques such as Markov chain Monte Carlo (MCMC), the 
Bayesian evidence is immediately available from the PMC sample used for parame- 
ter estimation without further computational effort, and it comes with an associated 
error evaluation. Also, it provides an unbiased estimator of the evidence after any 
fixed number of iterations and it is naturally parallelizable, in contrast with MCMC 
and nested sampling methods. By comparison with analytical predictions for simu- 
lated data, we show that our results obtained with PMC are reliable and robust. The 
variability in the evidence evaluation and the stability for various cases are estimated 
both from simulations and from data. For the cases we consider, the log-evidence is 
calculated with a precision of better than 0.08. 

Using a combined set of recent CMB, SNIa and BAO data, we find inconclusive 
evidence between flat ACDM and simple dark-energy models. A curved universe is 
moderately to strongly disfavoured with respect to a flat cosmology. Using physically 
well-motivated priors within the slow-roll approximation of inflation, we find a weak 
preference for a running spectral index. A Harrison-Zel'dovich spectrum is weakly 
disfavoured. With the current data, tensor modes are not detected; the large prior 
volume on the tensor-to-scalar ratio r results in moderate evidence in favour of r = 0. 
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1 INTRODUCTION 

We have reached an era of precision cosmology, as impressive 



what Peebles (8037) called 'accurate cosmology'. This next 



constraints on cosmological parameters attest (e.g. Dunkley 
et al.|2009 among many others) . Parameters of the standard 
ACDM model arc measured with uncertainties of a few per- 
cent. At the same time, we have not made the transition to 
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and qualitatively different step involves the scrutiny of the 
underlying model rather than the ever more precise deter- 
mination of model parameters. This approach is particularly 
important in the field of cosmology which relies on a con- 
siderable extrapolation of known physics to large scales and 
high energies (in the early Universe), and lacks physical un- 
derstanding, e.g. of the dark sector. Subsequently, cosmology 
has spawned a multitude of different models. 
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There are several statistical approaches for comparing 
competing models. Most of them aim at a balance between 
the ability of a model to fit observational data, and its com- 
plexity. While information theory approaches like Akaike's 



criterion (AIC, [Akaike 19741 provide explicit penalisations 
for the complexity of a model, Bayesian analysis compares 
directly the posterior probabilities of models, in favour of 
each of one of the given models. In that sense, Bayesian 
analysis has often been argued to propose an automated Oc- 
cam's razor, see e.g. |Berger fc Jeffreys| ( [19921 ), or |MacKay 
( 2002 1 . Following the classical Bayesian approach ( Jeffreys 



1939), the comparison between models integrates out the 



parameters within each model and automatically penalises 
larger parameter spaces. 

From a practical viewpoint, especially for high- 
dimensional parameter spaces, the calculation of the evi- 
dence is very challenging. While fast approximations exist, 
such as the Bayesian Information Criterium (BIC, Schwarz 
1978), Laplace (see e.g. Heavens et al.||2007), or variational 



Bayes (e.g. MacKay 20021, they can fail dramatically for 



posterior distributions which are not well approximated by 
a multivariate Gaussian. 

Recently, a new adaptive importance sampling method 
called Population Monte Carlo (PMC) was introduced 
I Cappe et al. 2004 , 2008 ) and successfully tested in a cosmo- 
logical context (Wraith et al.|2009| hereafter WKB09). PMC 



has since been used for a range of applications in cosmology 
( Benabed et al.|2009||Menard et al.|2009||Schrabback et al. 



2010) 



WKB09 focused on parameter estimation with the 
PMC sampling algorithm. In this second paper, we use the 
PMC method to estimate the Bayesian evidence and assess 
the accuracy and reliability of this estimate. We emphasize 
that the same set of sampled values used for parameter esti- 
mation can also be used to calculate the Bayesian evidence. 
Thus with the PMC method, model selection comes at the 
same computational cost as parameter estimation. 

This paper is organised as follows: We describe the ba- 
sics of the PMC algorithm in Sect. [2] Section [3] assesses 
the performance and reliability of PMC to calculate the 
Bayesian evidence using numerical simulations. In Sect. [4] 
we use PMC to compare cosmological models in the context 
of dark energy and primordial perturbations. Our findings 
are summarised in Sect. [5] 



2 BAYESIAN MODEL SELECTION WITH 
POPULATION MONTE CARLO 

2.1 Bayesian evidence and Bayes' factor 

The Bayesian evidence E in favour of a model 9JI with like- 
lihood function L and prior distribution P is the average 
of the likelihood function, weighted by the prior, over the 
parameter space 



E 



L(x)P(x)dx 



n(x) dx. 



(1) 



where n(x) — L(x)P(x) is the unnormalised posterior. We 
denote the normalised posterior density by tt(x), the nor- 
malisation constant being the evidence, ti(x) — -k(x)/E. 

The Bayes factor ([Jeffreys 1939]) is the ratio of the ev- 
idences Ei and E2 for two competing models 9Jti and OT2, 



Table 1. Jeffrey's scale to quantify the 'strength of evidence' for 
a corresponding range of the Bayes factor B12 in assuming 
B12 > 1. 



ln_Bi2 B12 



Strength 



< 1 < 2.7 

1...2.5 2. 7... 12 

2. 5... 5 12... 150 

> 5 > 150 



respectively, 



Inconclusive 
Weak 
Moderate 
Strong 



Si 



Ei 
E2 



(2) 



If B12 is larger (smaller) than unity, the data favour model 
2Jti (3JI2) over the alternative model. To quantify the 
'strength of evidence' contained in the data, [Jeffreys ( 1961 1 
introduced an empirical scale, see Table [l] This is only a 
rough guideline for decision making and of course the pro- 
posed boundaries are mostly subjective. For a comprehen- 
sive review of Bayesian model selection, we refer the inter- 
ested reader to |Chen et aT[ ( |2000| ) and |Trotta| ( |2008[ ). 



2.2 A note on priors 

The Bayesian evidence depends crucially on the prior distri- 
bution on the parameters. Firstly, in the Bayesian frame- 
work, the prior is an integral part of the model, since 
Bayesian inference automatically yields the updated results 
with respect to prior knowledge. Secondly, the concept of 
predictability or complexity of a model makes sense only 
in comparison with the model prior; this is the core of the 
Lindley- Jeffrey paradox (Lindlcy 1957) which is illustrated 
nicely in Trotta (20071. In short, a model is penalised if it 



requires fine-tuning of parameters, corresponding to a pos- 
terior distribution that is very concentrated in terms of the 
prior mass. I Efstathiou|2008 1. As Efstathiou (20081 pointed 
out, the lack of a physically well- motivated model and there- 
fore the choice of an ad-hoc prior will strongly decrease the 
usefulness of a model selection analysis. Since fundamental 
physical understanding is often lacking in cosmology, the 
application of the Bayesian evidence might indeed be lim- 
ited. However, we should not consider this as a fallacy of 
Bayesian inference but rather take it as a motivation to find 
well-defined physical models which can be compared in a 
sensible way ( |Trotta||2008[ ). 



2.3 Estimating the Bayesian evidence with 
importance sampling 

We propose to estimate the evidence using importance sam- 
pling (IS). IS provides a converging approximation of the 
integral |l] as follows. For a probability density function 
q whose support includes that of the posterior n, we can 
transform |l]) as 



E 



n(x) dx 



ty(x) 
q{x) 



q(x) dx. 



(3) 
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IS performs a Monte-Carlo integration of E by drawing N 
samples xi,...xn from the importance function q to ap- 
proximate E with the sample average 



E ; 



1 



7r(x„) 
<?0n) ' 



(4) 



where the «i„ are called (unnormalised) importance weights. 
For later use, we introduce their normalised counterparts, 
marked with a bar, 

Wn = — Jf . (5) 

Z^m = l Wm 

Of course, the quality of the estimate Q of E depends on 
the choice of the importance function q. The variance of g 



o- E = —d (ir\\q), 
with the so-called chi-square distance 

= 2 (x) 



d: 



dx- 1. 



(6) 



(7) 



Therefore, a suitable choice of q is such that d 2 (-7f||g) is as 
small as possible. In practice, this means that for each spe- 
cific problem (i.e., each specific tt), efficient importance func- 
tions have to be found, which is a non-trivial task. Hence the 
appeal of adaptive IS methods which rely on numerical op- 
timization schemes that can automatically select a suitable 
q. In the next section we discuss an algorithm based on an 
alternative measure of fit between tt and q, for which closed 
adaptation expressions can be devised for q being a mixture 
of Gaussian or Student-t distributions. 



2.4 Adaptive importance sampling 

Population Monte Carlo tackles the problem of the impor- 
tance function choice by an adaptive solution: The PMC 
algorithm produces a sequence q l , t — 1, ... ,T of impor- 
tance functions aimed at progressively approximating the 
posterior tt. 

The quality of approximation is measured in terms of 



the Kullback-Leibler divergence (Kullback & Leibler 1951 
|Cover fc Thomas|1991| ) from the posterior, 



K{H\\q 



Tt(x) 

q'(x) 



7r(x) dx, 



(8) 



rather than the chi-square distance Q, and the density g* 
can be adjusted incrementally to minimize this divergence. 
Note that the optimisation algorithm is independent of the 
normalisation of the posterior. Therefore, for all practical 
purposes, the unnormalised posterior tt can be used in (JsJ) . 

The importance function should be selected from a fam- 
ily of functions which is sufficiently large to allow for a close 
match with tt but for which the minimization of Q is com- 
putationally feasible. Cappe et al. ( 2008 ) propose to use mix- 



ture densities of the form 



q\x) = q(x\a t ,6 t ) = ^ ct d (p(x; 6* d 



(9) 



where a' = (a*, . . . , a^) is a vector of adaptable weights for 
the D mixture components (with a d > and Ysd=i a d ~ •"•)) 



and Q t = (6\ , . . ■ , 6%) is a vector of parameters which spec- 
ify the components; ip is a parametrized probability den- 
sity function, usually taken to be multivariate Gaussian or 
Student-t. This choice of the importance function is very 
flexible and allows to approximate a wide range of posteri- 
ors. 

The first sample in this adaptive scheme is produced by 
a regular importance sampling mechanism, x\ , 
associated with importance weights 



, xjy 



1,. 



,N. 



(10) 



providing a first approximation to a sample from tt. 

At each stage t of the iteration, the sample points and 
weights from that iteration, (x', w\), . . . , (xJv, w%), are used 
to update the importance function q l to the new one, q . 
During the next iteration, a new sample x^ +1 , . . . , x^ 1 is 
then drawn from the updated importance function q t+1 ■ 

The updating method is based on a variant of the 



Expe ctation-Maximization algorithm (EM, Dempster et al 
19771. New parameters a t+1 ,8 t+1 are obtained by carrying 



out an IS approximation of the update of the EM algorithm 
to obtain a reduction in the Kullback-Leibler divergence Q . 

As an illustration of a simple updating rule, the new 
weights of the mixture components can be calculated ac- 
cording to 



^2 wi IdiXn) 



(11) 



Here, l d {x) denotes the d th -component indicator function 
which is unity if x has been drawn from component d, and 
zero otherwise. The updated component weight a a +1 is thus 
the sum of the normalized importance weights of the sample 
points drawn from the d th component, leading to a simple in- 
tuition of the adaptation. Points sampled from a component 
which approximates the posterior well (badly) have large 
(small) weights, and the component will be up- weighted 
(down-weighted) in the update. Note that in our implemen- 
tation of PMC we use improved and more robust versions 
of this updating rule. Details of the algorithm as well as 
formulae and their derivations for updating the mean and 
covariance in the case of Gaussian and Student-t mixtures 



are given in Cappe et al. ( 2008 1 and WKB09 



2.5 Diagnostics 

Although a formal stopping rule for the above described it- 
erative process does not exist, performance measures can be 
defined to serve as guidelines. As PMC aims at minimiz- 
ing the Kullback-Leibler divergence K (JsJ) across iterations, 
one can stop the process when subsequent importance func- 
tions do not yield a significant decrease of K. We estimate 
exp[— 7f(-7f || g*)] by the perplexity 



p = exp{Hh)/N, 



where 



(12) 



(13) 



is the Shannon entropy of the normalised weights. Values 
of p close to unity will therefore indicate good agreement 
between the importance function and the posterior. 
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Another frequently used criterion for importance sam- 
pling is the so-called effective sample size (ESS), 



ESSf, 



( N 

\n=l 



(14) 



with 1 ^ ESSn ^ N. The effective sample size can be inter- 
preted as the number of sample points with non-zero weight 
(Liu &Chen| 19951). Both measures (112111 41) are related, as an 



importance function which is close to the posterior density 
will in general have both a high perplexity and a relatively 
large number of points with non-zero weight, compared to 
an ill-fitting importance function. 



2.6 The initial proposal 

The efficiency of the algorithm is dependent on the initial 
choice of the proposal. A poor initial importance function, 
e.g. a single-mode function in the case of a multimodal pos- 
terior or a too narrow function with light tails, may take a 
very long time to adapt or even miss important parts of the 
posterior. For importance sampling the choice of q requires 
both fat tails and a reasonable match between q and the 
posterior ir in regions of high density. Such an importance 
function can be more easily constructed in the presence of a 
well-informed guess about the parameters and possibly the 
shape of the posterior density. In our application of PMC to 
cosmology (Sect. [4]) we will use the Fisher matrix as an aid 
for the initial proposal. 




After 1 0th iteration 



After 1 0th iteration 



Figure 1. Top panels: The posterior distribution | |16| l for = 
0.03, with the true 68.3% (blue), 95% (green), and 99.7% (black) 
density contours (left) and M = 100 simulated data points drawn 
from fjl6| , both shown in the first two dimensions. Bottom panels: 
Estimates over 100 simulation runs after the 10 th iteration of /3 
(left), and the Bayesian evidence (right). The distributions are 
shown as whisker plots: the thick horizontal line represents the 
median; the box shows the interquartile range (IQR), containing 
50% of the points; the whiskers indicate the interval 1.5xIQR 
from either Ql (lower) or Q3 (upper); points outside the interval 
[Q1,Q3] (outliers) are represented as circles. Posterior means of 
E and /? from the simulated data are indicated as dashed lines. 



2.7 Summary 

PMC offers a fast and reliable way to estimate the Bayesian 
evidence, see eq. (j4j), with an expression of the variance of 
this estimator provided by j6j. Diagnostics of the reliability 
of the sampling are at hand. The evaluations of the like- 
lihood function can further be massively parallelized with 
importance sampling, offering an enormous decrease of the 
required wall-clock time to obtain the evidence. This fea- 
ture is almost unique to importance sampling techniques and 
thus not partaken by alternative techniques such as nested 
sampling and MCMC. We stress again that the calculation 
of the evidence comes at virtually no further computation 
cost than parameter estimation using PMC. 



3 SIMULATIONS 

In this section we use simulated data and a toy model for 
the posterior density to assess the performance of the PMC 
approach to provide an accurate estimate of the evidence. 
We use a non-Gaussian posterior density, twisting a centred 
d-dimensional multivariate Gaussian in the first two dimen- 
sions. One can easily build a sample under this twisted pos- 
terior distribution, using 



(xi,x 2 , 



■ A/d(0, E), 



(15) 



where E = diag(<r 1 ,l, 
forming it to 



1) is the covariance, and trans- 
(x'^x'2 - Pix'j 2 ~ o-f),x' 3 , . . -,x' d ). (16) 



The twist parameter j3 controls the degree of curvature. This 
example was also used in WKB09 to assess the performance 
of PMC for parameter estimation. In the following, we will 
use d = 5 and a\ = 100. 



3.1 Unknown twist f3 

As a first benchmark, our interest is in estimating /3 and 
in obtaining the evidence E by integrating out j3 from the 
unnormalised posterior distribution, i.e. 



E : 



(17) 



As /3 is a scalar we can easily calculate (using a grid-point or 
adaptive quadrature approach) the evidence by integrating 
over this one-dimensional domain. 

For this example we take a simulated data set x with 
size M = 100, and input twist ft — 0.03. Fig. [T] shows 
the confidence contours of the posterior density defined by 



eq. ( 16 1 (top left panel) and the simulated data points (top 
right) in the first two dimensions. From eq. ( |16[ ), the likeli- 
hood is defined as 



m = n 



7 exp(- 



0.5 x m S 



(18) 



I (2 7 r) d / 2 |E| 1 /2 

where each x m = (x m i,x m2 + /3(»mi — Oi),Xm3, ■ ■ ■ ,x md ) 
is a transformed sample point according to (16 1. Note that 



the Jacobian of the transformation is unity. The prior for ft 
is uniform on the unit interval, P(/3) = U(0, 1). 
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The initial importance function q 1 was chosen to be 
a mixture of three Gaussian distributions with means ran- 
domly displaced around zero and variance 0.5, allowing for a 
somewhat vague initial coverage of the parameter space. For 
the results to follow the number of points at each PMC iter- 
ation was TV — 1000, with a number of PMC iterations equal 
to T — 10. To assess the variability and distribution of the 
results we repeated this process 100 times. Fig.[I]shows the 
estimates of f3 (bottom left) and the evidence (bottom right) 
over the 100 simulation runs after the 10 th iteration. We find 
for the mean and 68% confidence /3 = 0.02998 ± 0.00002 
and InE = -264.0342 ± 0.0010. Comparing these results 
to the posterior mean values corresponding to the sample 
of M = 100 simulated data points, /3 = 0.0299862 and 
InE = —264.0344 suggests that PMC performs very well 
in providing an accurate estimate of the evidence and of /3. 
If we use only TV — 1 000 sample points per iteration, we still 
get stable and reliable results, but with a fourfold increase 
of the error bar of InE. 



3.2 Known twist /3 

For the second case, we use the slightly twisted centred 
multivariate Gaussian as before but this time f3 is known, 
ft — 0.03 as before, and we integrate over the posterior rep- 
resented by ||16[|, i.e. 



E = / Ti(x\P,T,)dx 



(19) 



is our target. This presents a much more difficult exercise 
with the tails of the posterior being a significant challenge 
to capture, and the dimension of the space to be integrated 
over is d = 5. To estimate the evidence with PMC in this 
second example, the number of points at each iteration was 
TV = 10000, the number of PMC iterations T = 10 and as 
in the first example we use 100 simulation runs to assess 
the variability of the estimates. After the 10 th iteration we 
draw a final sample of size TV = 100, 000. The initial im- 
portance function q is a mixture of multivariate Student-t 
distributions with components displaced randomly in differ- 
ent directions slightly away from the centre of the range for 
each variable: the mean of the components is drawn from a 
p-multivariate Gaussian with mean and covariance equal 
to Eo/5 where Eo is the covariance of the proposal compo- 
nents. We choose a mixture of 9 components of Student-t 
distribution with v = 9 degrees of freedom; and Eo is a di- 
agonal matrix with diagonal entries (200, 50, 4, . . . , 4). This 
choice of (y, Eo) ensures adequate coverage, albeit somewhat 
overdispersed, of the feasible parameter region. 

Fig. [2] shows the estimates of the evidence (bottom 
right) against the true value over the 100 simulation runs 
after the 10 th iteration. 

The results for this more difficult case suggest that PMC 
performs reasonably well in providing an accurate deter- 
mination of the evidence. The estimates at each iteration 
(Fig. [2| bottom left) are stable, with a reduction in the vari- 
ability seen as the importance function better adapts to the 
posterior density. The adaptation performance can be seen 
by the increase in estimates of the normalized perplexity and 
effective sample size (Fig. [2J top right) for successive itera- 
tions. See fig. 1 of WKB09 for a graphical example of the 
adaptation of the importance function to the posterior for 








•*• -L. J- ^ - § » 


1 

e 

" ! • 




1 23456789 10 
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After 10th iteration 



Figure 2. PMC sampling of | |16| l as a function of iteration over 
100 simulation runs. Top left: Estimates of the perplexity l |12| ); 
Top right: Effective sample size (14) ; both normalized by TV. Bot- 
tom: Estimates of the evidence at each iteration (left), and after 
the 10 th iteration using TV = 100,000 sample points (right). The 
true value is indicated as a dashed line. See Fig. [T] for details 
about the whisker plot representation. 



d — 10. After the final iteration (Fig. [2] bottom right), the 
estimate of the evidence is In E = -0.0019 ± 0.0038 at 68% 
confidence. The presence of a slight downward bias from the 
true value of zero is not unexpected due to the use of the 
log-scale (as a consequence of Jensen's inequality, an unbi- 
ased estimate of E would look biased when plotted on the 
log-scale) . This could also be due to the importance function 
not fully exploring the low probability tails of the posterior 
density. However in the case of Fig. [2] the bias in terms of 
the scale is very small, and is associated with equally small 
variability of the estimates of E. 



4 COSMOLOGY 

The so-called standard model of cosmology is successful 
in explaining recent observations of cosmology, such as 
the CMB, supernovae of type la (SNIa), galaxy cluster- 
ing including baryonic acoustic oscillations (BAO), cos- 
mic shear, galaxy cluster counts, and Lya forest cluster- 
ing. This flat ACDM model has only six free parameters 
(f2 m , fib, h, n s , t, as, or functions thereof) and is therefore 
surprisingly simple. 

Despite this, various extensions to the standard model 
are considered and tested routinely using observational data. 
These extensions may be based on independent evidence (for 
example massive neutrinos from oscillation experiments), be 
predicted by a compelling hypothesis (primordial gravita- 
tional waves from inflation) or reflect our ignorance about 
the fundamental physics (dynamical dark energy). What- 
ever be the case, future surveys and analyses are to answer 
the question which of the many models is the one to best de- 
scribe the observations. So far, no extension of the standard 
model has been strongly supported by the data. 
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In this paper, we use the Bayesian evidence as a tool 
to compare different models and their ability to describe 
cosmological data. As described in Sect. |2.3] we use PMC to 
sample the posterior and to calculate the evidence. We take 
recent data of CMB ( |Hinshaw et al.|2009|>, SNIa ( |Kowalski 
et al. 120081) and BAO (jEisenstein et al. 120051). The extensions 



to the standard model of cosmology concern dark energy and 



curvature (Sect. 4.2), and inflationary models (Sect. 4.3 1 



4.1 PMC set-up 

To set up PMC we have to choose the initial proposal, q 1 , the 
number of sample points, N, and the number of iterations, 
T. We take q 1 to be a Gaussian mixture model with D com- 
ponents, which are displaced from the maximum-likelihood 
point by a random shift /shift in each dimension, with /shift 
being the fraction of the prior parameter range. The covari- 
ance of the components corresponds to the Fisher matrix 
rescaled by a number / V ar, typically of order unity, or larger 
if the Fisher matrix is suspected to be significantly narrower 
than the posterior curvature. 



For the dark-energy and curvature models (Sect. 4.2), 
we choose the number of iterations T to be 10. If after 10 
iterations the perplexity is still low, say, smaller than 0.6, 
we run PMC for more iterations. The choices of the N and 
D are linked: the average number of points sampled under 
an individual mixture-component, N/D, should not be too 
small, to ensure a numerically stable updating of this com- 
ponent. We choose N = 7 500 and D = 10. 



For the primordial models (Sect. 4.3 1 we take T = 
5, TV = 10 000 and D between 7 and 10, depending on the 
dimensionality and shape of the likelihood. 

The parameters controlling the initial mixture means 
and covariances, are chosen for both cases to be /shift = 
0.02, and / var between 1 and 1.5. Those values are educated 
guesses and can be refined according to the evolution of the 
proposal components during the first few iterations. E.g., if 
many components die off early, they start too far from the 
high-posterior region and /shift should be decreased. For the 
final iteration we choose a five-times larger sample than for 
previous iterations. 



Table 2. Prior ranges for dark energy and curvature models. 
In case of w(a) models, the prior on Wi depends on wo, see 

sect, ion 



Parameter 


Description 


Min. 


Max. 




Total matter density 


0.15 


0.45 




Baryon density 


0.01 


0.08 


h 


Hubble parameter 


0.5 


0.9 


n K 


Curvature 


-1 


1 


wo 


Constant dark-energy par. 


-1 


-1/3 


U'l 


Linear dark-energy par. 


-1 - w 


— 1/3— too 
1-aacc 



priors for those three parameters; the prior ranges for all 
parameters can be found in Table [2] We verified that the 
relative evidence between models, or Bayes factor, does not 
depend on the prior ranges for nested parameters if the 
high-density likelihood region is situated far from the prior 
boundaries. 



4-2.1 Dark-energy prior 

The simple parametrization of w clearly is not motivated 
by fundamental physics of dark energy. However, this choice 
represents the most simple models which go beyond a cos- 
mological constant; it therefore makes sense to use those 
extensions in a model-selection framework. To define a phys- 
ically sound prior of these dark-energy parameters, we re- 
strict ourselves to a specific class of models. Our goal is to 
find a model which is able to explain the observed, recent 
accelerated expansion of the Universe. The model should 
therefore include a component to the matter-density tensor 
with w(a) < —1/3 for values of the scale factor a > a acc . 
We choose a acc = 2/3. To limit the equation of state from 
below, we impose the condition w(a) > —1 for all a, thereby 
excluding phantom energy as in Efstathiou (20081. Fig. [4] 



shows the allowed range in the case of two dark-energy pa- 
rameters. We note that our approach is inconsistent to some 
extend in that the data on which the observation of accel- 
erated expansion is based on is part of the data used in this 
analysis. 



4.2 Dark energy and curvature 

Here we test the standard ACDM-model assumptions of a 
cosmological constant and flatness. We parametrize the dark 
energy equation-of-state parameter as constant and as linear 
function in the scale factor a, respectively. Together with the 
basic model for which w — —1, we compare the three cases: 



w = —1 

W = Wo 

W = Wo ~\~ Wl (1 — d) 



ACDM 

wCDM (20) 
iu(z)CDM 



In addition, the curvature parameter Qk for each of the 
above models is either Qk = ('flat') or Qk ('curved'). 

We do not take into account dark energy clustering. The 
observational data is reduced to purely geometrical probes 
of the Universe; for CMB these are the distance priors ( |Ko-| 
matsu et al.||2009 ) and for BAO the distance parameter A 
( jEisenstein et aT" |2005| |. The common parameters for all 
models are Q-m,Qb and h. All models share the same flat 



4-2.2 Curvature prior 

A natural limit on the curvature is that of an empty universe; 
this certainly places an upper boundary on the curvature, 
corresponding to Qk = 1. A lower boundary, correspond- 
ing to an upper limit on the total matter-energy density, 
is less stringent. We choose Qjf > —1; this 'astronomer's 
prior' (Varda nyan et aL||2009[ ) provides a symmetric prior 
around the null-hypothesis value and excludes high-density 
universes which are ruled out by the age of the oldest ob- 
served objects. 

An alternative prior on Qk could be derived from the 
paradigm of inflation. However, most inflationary scenarios 
imply the curvature to be extremely close to zero, on the 
order of 10" 60 . The likelihood over such a prior on Qk is 



essentially flat for any current and future (Watcrhou.se k, 
Zibin|[2008 1 experiment. A model with such an uninforma- 
tive likelihood would be indistinguishable in terms of the 
Bayesian evidence with respect to a flat model. 
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Figure 3. Evidence for various models OTi with respect to the 
reference flat ACDM model 2K 2 with 

TZpar — 3. The different com- 
binations of data are CMB (blue triangles), CMB+SNIa (green 
circles) and CMB+SNIa+BAO (red squares). Note that the Bayes 
factor between different non-reference models can only be com- 
pared for the same combination of data (same symbols). 




-1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 
WO 

Figure 4. 68%- and 95% confidence regions for WMAP 
(solid blue lines), WMAP+SNIa (dashed green) and 
WMAP+SNIa+BAO (dotted red curves). The allowed range for 
the dark-energy parameters wq and wi lies between the two red 
straight lines. 



4.2.3 Results 

In Fig. [3] we plot the Bayes factor for various models with 
respect to the standard flat ACDM model. In most cases 
there is positive evidence in favour of the standard model. 
This evidence against more complex models increases if more 
probes are combined. This is not surprising: no deviation 
from w = — 1 and Qk = has been found, additional pa- 
rameters are not supported by the data. The more data are 
added, the tighter get the constraints around the standard 
values, therefore the stronger gets the evidence in favour of 
this simplest model. 

The largest positive evidence is ln_Bi2 = 1.8, for the 
w(z)CDM model and CMB alone. In this case, as can be seen 
in Fig. |4j a large part of the prior range is still allowed by 
the data, and a region of comparable size is excluded. There 
is weak evidence that the two extra-parameters wo and wi 
are indeed required by the data. When adding SNIa and 
BAO, most of the prior range is excluded, and this 'waste' of 
parameter space is penalised by decreasing the Bayes factor. 

Regarding the prior on u>o, our 10CDM model corre- 
sponds to Model II from Serra et al. (20071. In that work, 



SN data alone led to a lnBi2 of around —0.2 (comparing 
ACDM to Model II). In our case, the combination of SN 
with CMB and BAO leads to a larger evidence in favour of 
ACDM. 

Our results on the curvature are compatible with the 
findings of Vardanyan et al. (20091. Using all three data 



sets, a non-flat universe is strongly disfavoured for all three 
dark-energy cases. For comparison, |Vardanyan et alT] ( |2009[ ) 
showed that using a flat prior in log ilpc, corresponding to a 



flat prior on the curvature scale, therefore largely increasing 
the prior volume, leads to inconclusive evidence. 



4-2.4 Stability of the results 

We test the reliability of the results for two of the cases 
presented in Sect. |4.2| 10CDM flat and wCDM curved, both 
using CMB+SN+BAO data. We repeat the PMC runs 25 
times. For a given scheme with fixed proposal parameters 
/shift, /var and D (Sect. |4~Tj |, we randomly vary the positions 
and widths of the initial proposal components. 

The distribution of the log-evidence InE and the nor- 
malized perplexity is shown in Fig. [5] for two cases of dark- 
energy models. Since the components of the proposal move 
towards the tails of the posterior with progressing iteration, 
the evidence keeps on increasing with better sampling of the 
tails. The high value at the first iteration is biased and dom- 
inated by a few points with very large importance weights 
w n = "K(x n ) I 'q(x n ), which are sampled from the proposal 
tails but lie in regions of high posterior density 7f . 

For the flat wCDM model the evidence converges after a 
few iterations showing a very small dispersion between runs 
with InE = -9.159 + 0.011 (68% confidence) at t = 10. The 
relatively large perplexity of p w 0.6 - 0.7 indicates a reliable 
sampling of the posterior. The posterior of the curved model 
is more elongated which makes the sampling more difficult. 
The perplexity does not exceed 0.4 even after 20 iterations. 
The evidence however stabilises onto a narrow interval after 
~ 18 iterations, hxE = -10.84 ± 0.077 (68% confidence) at 
t = 20. 
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Figure 5. Distribution of 25 PMC samplings of two dark-energy 
models, flat u>CDM {left panels) and curved toCDM {right pan- 
els). Top (Bottom): The log-evidence (perplexity) as function of 
iteration. See Fig. [T] for details about the whisker plot represen- 
tation. 



4.3 Primordial perturbations 

In the following, we test models corresponding to various de- 
scriptions and parametrizations of primordial fluctuations. 
The (dark-matter) density fluctuations are given by the 
power spectrum as function of scale fc, 



P s (k) oc k 



n s +2 aB ln ( fc / fc o) 



(21) 



with the parameters n s being the scalar spectral index, 
and q s the 'running' of the index, i.e. the first-order Tay- 
lor term of the exponent. The pivot scale ko is fixed to 
fc* = 0.002 Mpc~\ In addition, tensor-modes (gravitational 
waves) have the power spectrum 



Pt(fc) oc k TI 



(22) 



with tensor spectral index n t . The ratio between tensor and 
scalar perturbation spectra at scale ko is denoted by r. In the 
standard model, a s = n t — r = 0, only n s is a free parame- 
ter. Most inflationary models predict n s to be slightly below 
unity, therefore the power spectrum of primordial density 
perturbations is a near scale-free power law. Tensor pertur- 
bations (gravitational waves) are expected to be non-zero, 
but their amplitude is unknown and current data have not 
been able to detect those modes. 

Although tensor-modes are expected from most models 
of the early Universe, they are not detected so far with the 
given sensitivity of current data. 

The models we consider for our Bayesian evidence anal- 
ysis are interpreted within the slow-roll approximation of 
inflation, as will be described in the next section. 



4-3.1 Slow-roll parameters 

The slow-roll approximation of inflation provides an infinite 
hierarchy of flow equations describing the dynamics of the 
single scalar field which drives inflation (see |Peiris fc Easther| 
and references therein) . The slow-roll parameters e and 
^ 1 are defined in terms of the potential V of the 
and the Hubble parameter H, 



2006 
scalar field 



A/ 





'H' 


47T 


H 


/ 2 


\ t 


/ mpi \ 


\ An 





H 



W 



(23) 
(24) 



where the prime denotes derivation with respect to <f>. The 
Planck mass is denoted by mp\. The hierarchy of flow pa- 
rameters can be truncated, since if some l Xh = 0, all higher 
terms ( \h,£ > L vanish. We consider the expansion up to 
first-order and set 1 \h = Vi 2 ^h = 0. The parameters of 
the primordial power spectra can be written in terms of the 
slow-roll parameters as 



1 - e 
-2e- 



2(1 +C)e 2 

(IO97 ~ 8e) ; 
(3 + C)6 2 + (l+C)e77 



1 + 2?7 - 4e - 

16e [l + 2C(e 
c 



5C)e V ; 



(25) 
(26) 
(27) 
(28) 



Here, C = 4(ln2 + 7) - 5 » 0.0814514 where 7 = 0.577216 
is the Euler-Mascheroni constant. For slow-roll inflation to 
take place, the slow-roll conditions e< 1 and | Xh\ <C 1 for 
all I have to be satisfied. 



4.3.2 Priors 

We use the slow-roll conditions to define priors on the pri- 
mordial parameters as ^ e ^ 0.1 and |?7| ^ 0.1. Al- 
though the exact values of the prior boundaries are some- 
what arbitrary, they have been considered by Ma rtin fc] 
|RingevaT| ( |2006| ) as natural limits for the validity of the 
Taylor-expansion of the power spectrum P{k) in ln(fc/fc*) 
around the pivot scale fc* = 0.05 Mpc -1 . We use fc* = 0.002 
Mpc -1 as pivot, in accordance to the WMAP5 analysis; we 
verified the equivalence of our results for both cases of fc* . We 
choose an uninformative (i.e. flat) prior on the slow-roll pa- 
rameters. From eqs. [ 25|28 l we get the corresponding ranges 
of the power-spectra parameters, see Table|3j which are now 
motivated from fundamental physical principles within the 
slow-roll model of inflation. We choose flat priors for the 
power-spectra parameters as well, although they are non- 
linearly related to the slow-roll parameters and the prior 
will have a different shape. However, we ignore this for sim- 
plicity. See 



Trotta 



(20071 for a similar approach to define 
a prior on the spectral index tilt. The tensor index n t is 
unconstrained by current data; therefore, we do not include 
this parameter. 

We compare various models to the standard paradigm, 
which now has six parameters (Jim, fib, h, n s , 10 9 A^, r). For 
our model testing, we single out individual parameters or 
combinations thereof. A strictly consistent and thorough 
treatment should treat the slow-roll parameters as primary 
parameters; this will be left for a future analysis. 
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Table 3. Prior ranges for primordial model comparison. The prior 
ranges for primordial parameters are derived from the slow-roll 
approximation. 



Evidence (ref. model M 2 : ACDM flat n s =const) 



Parameter 


Description 


Min. 


Max. 




Total matter density 


0.01 


0.6 


fib 


Baryon density 


0.01 


0.1 




Reionisation optical depth 


0.01 


0.3 




Normalisation 


1.4 


3.5 


h 


Hubble parameter 


0.2 


1.4 


n s 


Scalar spectral index 


0.39 


1.2 


a s 


Running of spectral index 


-0.2 


0.033 


r (lin. prior) 


Tensor-to-scalar ratio 





1.65 


lnr (log. prior) 


Tensor-to-scalar ratio 


-80 


0.50 



4.3.3 Results 

In Fig. [6] we show the Bayes factor of various models 97ti 
with respect to the standard model a flat ACDM uni- 
verse with n s — const. A running spectral index is favoured 
weakly, all other cases are disfavoured. The evidence against 
the Harrison-Zel'dovich model (n s = 1) is weak, whereas 
tensor perturbations are moderately disfavoured. For illus- 
tration, we include a tensor-mode model with flat prior for 
lnr instead of r; the minimum is chosen to be -80, corre- 
sponding to the energy scale of Big Bang Nucleosynthesis 
as a conservative lower limit of the inflation energy scale 
( jParkinson et al7||2006[ >. The large prior of the logarithmic 
tensor-to-scalar ratio causes this model to be strongly dis- 
favoured. Note however that this example is not consistent 
with flat priors for the slow-roll parameters; it rather corre- 
sponds to a model in which very small slow-roll parameters 
are much more likely than large ones. 

Previous results showed both positive as well as nega- 
tive evidence for a scale-free, Harrison-Zel'dovich (HZ) spec- 



trum compared to a tilted power law ((Bridges et al. 2006 



Mukherjee et al.||2006| |Bridges et "aI1[2007| |Trotta||2007|) 



The evidence in either direction is however moderate at 
most. More narrow priors on n s increase the evidence for 
the HZ model. Our result is in agreement with those pre- 
vious works, taken into account the differences in the em- 
ployed data and priors. This shows that the evidence against 
a scale-free spectrum is not yet substantial with the current 
data, and it calls for physically motivated prior density. 

Previous results showed positive and negative evidence 
for a non-zero running of the spectral index a s , depending 
on the prior width (Brid ges et al.|2 007). Our positive value 
of +1.73 is relatively high but still corresponds to only weak 
evidence in favour of a s 7^ 0. 



5 CONCLUSION 

The Bayesian evidence E provides a mathematically consis- 
tent and intuitive tool to compare different models and to 
choose between competing models. Its calculation in high- 
dimensional parameter spaces is in general numerically chal- 
lenging, as the posterior distribution may be multimodal 
and/or show strong non-linear parameter-dependencies. 

Simplification methods such as the Laplace approxima- 
tion are not sufficient and cannot replace the full integration 
of the posterior over the parameter space. Laplace assumes 



£ -4 
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running 



n s =const 



HZ 



running+tensor- 



tensor 



tensor(ln prior) 



par 



Figure 6. Evidence for various models OTi with respect to the 
reference model OT2, a flat ACDM universe with constant n s . 



a multivariate (i.e. single-peaked) Gaussian likelihood and 
a prior which is much wider than the likelihood. If either of 
these assumptions is violated, the approximation can give 
values of In E which are wrong by several dex. Although in 
many cases the likelihood might be close to Gaussian, or 
can be brought in such a form by parameter transforma- 
tions, this is not possible in general, and often there exist 
hard, physical priors which cut off the likelihood. 

In this paper, we use Population Monte Carlo (PMC) 
to estimate the Bayesian evidence. PMC is a new, adaptive 
importance sampling method which offers an efficient and 
reliable way to obtain the Bayesian evidence for non-trivial 
likelihood shapes. An expression for the variance of the ev- 
idence estimator is introduced. We will leave a study of an 
estimator of this variance for future work. Here, we repeat 
PMC runs to study the distribution of the evidence, and 
show the robustness of the results. If the initial proposal is 
badly chosen the estimate of E can be significantly off; such 
cases can however be identified by the in-built diagnostic 
tools of PMC, e.g. the perplexity and effective sample size. 
In such a case, the perplexity would remain very low and 
never reach values close to its maximum of unity. Additional 
indications of an unsuited initial proposal is the vanishing of 
most mixture component. Those components do not cover 
part of the high-density posterior region, and it is very un- 
likely that the remaining components will well approximate 
the posterior. An illustration of a well-behaving case can be 
seen in Figure 8 of WK09, where the components spread out 
nicely until they remain stationary after a few iterations. 

Other methods to estimate the Bayesian evidence have 
been proposed, e.g. vegas ( |Lepage| |1978| |Serra et al.| 
|2007[ ) which necessitates that the likelihood function can 
be brought into a separable form. Another computation- 
ally efficient since to some extent parallelizable algorithm 



is (multi-)nested sampling (Skilling 2006 Feroz & Hobson 
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2008} |Feroz et al.||2009|. While being a sp ecial case of im- 
portance sampling (Robert & Wraith 2009 1, nested sampling 



draws from the prior instead from an importance function 
approximating the posterior as in PMC. See |Clyde et aT] 



(20071; Marin & Robert (2325 I and Robert & Wraith (2009) 



for an overview of various methods of Bayesian evidence es- 
timation including Markov chain Monte Carlo methods. 

We have applied Bayesian model selection to two do- 
mains of cosmology, the accelerated expansion of the Uni- 
verse in the recent past and primordial fluctuations in 
the early Universe. For the former, we analysed simple, 
parametrized models of dark energy; the latter used the 
slow-roll approximation of inflation. We employed recent 
cosmological data corresponding to CMB, SNIa and BAO. 

No dark-energy model is strongly or even only moder- 
ately favoured over the standard ACDM paradigm. This is in 
spite of the rather strong prior on dark-energy, i.e. excluding 
phantom energy and requiring an accelerating component 
in the recent past. More general dark-energy models with 
larger parameter spaces will likely be disfavoured with re- 
spect to ACDM. This is even true if future experiments find 
deviations of w from -1 unless the error bars get extremely 
small (Lindley- Jeffrey paradox). This should serve as a mo- 
tivation to define a tight physical framework for dark-energy 
models with stronger prior constraints on parameters. 

We find strong evidence against a non-flat universe us- 
ing the combined CMB+SNIa+BAO data. This is true re- 
gardless of the chosen dark-energy model. It holds for a prior 
belief that \Q K \ ^ 1. 

We use a natural limit on slow-roll inflation param- 
eters to deduce prior ranges for the primordial perturba- 
tion spectra parameters. The preferred model contains a 
non-zero running spectral index (a s 7^ 0) and no tensor 
modes (r = 0). The evidence is however only weak. A scale- 
free, Harrison-Zel'dovich model is weakly disfavoured. Ten- 
sor modes are moderately to strongly disfavoured, depend- 
ing on the prior shape on r. As a consequence, future de- 
tections of tensor modes have to be done with very high 
significance, to strongly disfavour a r = model. 
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